In this notebook an integrated analysis is preformed.
Pair interaction data of miR - target(mRNA) and DNAm - target(mRNA) is an output from notebook 02-Regulation_targets.
Integrative analysis is goint go be focused on mRNA so a table with significantly DE and one of its regulator. Since there are some genes that have multiple regulators mir’s with the best correlation will be selected as a regulator. For DNAm !!! get back to this
To explore the influence of miR and DNAm on mRNA expression we will preform GSEA on groups of genes that are under regulation of DNAm, mir or both. Next, we will make an unsupervised clustering on genes with FC features to find patterns in gene expressions. (OVO ISTO POBOLJŠATI)
library(venn)
library(dplyr)
library(GenomicRanges)
library(gprofiler2)
library(DESeq2)
library(tibble)
library(kohonen)
source(file = file.path(getwd(), "scripts", "idMap.R"))
source(file = file.path(getwd(), "scripts", "integrateOmicsData.R"))
source(file = file.path(getwd(), "scripts", "kmeansHeatmap.R"))
# mRNA
mRNA_res <- readRDS("outputs/RDS/mRNA_res.rds")
# miR
mir_res <- readRDS("outputs/RDS/mir_res.rds")
# dnam
dnam_res_cgi <- readRDS("outputs/RDS/dnam_res_cgi.rds")
# miR-mRNA
targ_mir_mRNA <- readRDS("outputs/RDS/targ_mir_mRNA.rds")
cor_mir <- readRDS("outputs/RDS/cor_mir.rds")
# dnam-mRNA
targ_dnam_mRNA <- readRDS("outputs/RDS/targ_dnam_mRNA.rds")
In this chapter we build the integrated table that will contain multiple information from different data sets.
Firstly, filtering mRNA results to get only significantly DE genes.
mRNA_de <- filter(as.data.frame(mRNA_res),padj < 0.05)
Venn diagram of DE genes under different types of regulations
simple_tb <-
data.frame(genes=rownames(mRNA_de)) %>%
mutate(mir=dplyr::case_when(
is.element(genes,targ_mir_mRNA$target_ensembl) ~ 1, TRUE ~ 0)) %>%
mutate(dnam=dplyr::case_when(
is.element(genes, targ_dnam_mRNA$ensembl) ~ 1, TRUE ~ 0))
# table of genes under regulation
table(simple_tb [,c("mir","dnam")])
## dnam
## mir 0 1
## 0 683 574
## 1 430 842
venn::venn(x =list(simple_tb[simple_tb$dnam ==1,"genes"],
simple_tb[simple_tb$mir ==1,"genes"]),
zcolor = "style",
snames = c("DNA methylated","microRNA"))
GSEA analysis on simple_tb
To explore the influence of different types of regulation we preform gsea on crude groups of genes where the genes are presneted in the venn.
From the gostplot there are not many terms associated with genes that are ONLY under DNA methylation regulation (Only muscle contraction -REAC). In the intersection part there are terms that are involved in cell cycle regulation, regulation of transcription, .. Genes under mir regulation have many terms involved with the immune system, also some terms involved with cell mobility (cell-cell adhesion, cell migration, collagen degradation). The rest that are not under dnam or mir regulation have terms involved in muscle contractions and some immunological pathways. TF database added later, comment: In all groups there are many hits with TF.
# Building multiple queries
query_ls <-
list(dnam = filter(simple_tb, mir == 0 & dnam == 1 )$genes,
mir = filter(simple_tb, mir == 1 & dnam == 0 )$genes,
intersection = filter(simple_tb, mir == 1 & dnam == 1 )$genes,
rest = filter(simple_tb, mir == 0 & dnam == 0 )$genes)
# Sources
GSEA_SOURCES <- c("GO:BP","CORUM","KEGG","REAC","WP","MIRNA", "TF")
gost <-
gost(query = query_ls,
organism= "hsapiens",
exclude_iea=TRUE,
domain_scope="annotated",
ordered_query=F,
sources=GSEA_SOURCES)
# Plot and table of significant results
gostplot(gost, capped = T, interactive = T)
Building a complete and comprehensive table with results of DE and DM of the transcriptomic (mRNA), small-RNA transcriptomic (mir) and DNA methylation (CpG islands) to have all significant data in one data frame. The comprehensive table will contain all mir mRNA and dnam mRNA pairs. ! add colnames!
# this steps for this are described in the script
integ_df <-
integrateOmicsData(mrna_res_obj = mRNA_res,
mir_res_obj = mir_res,
dnam_res_obj = dnam_res_cgi,
mir_targets_corr = cor_mir )
## IDTABLE reading from filename: /home/katarina/Projects/HNSCC/scripts/IDTABLE_hg19.rds
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(input.id)` instead of `input.id` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(output.id)` instead of `output.id` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
message(paste0("Dimensions of integrated tbl: ", dim(integ_df)[1]))
## Dimensions of integrated tbl: 4888
# Perview of the table
head(integ_df)
## gene gene_mean gene_FC gene_padj symbol mir
## 1 ENSG00000000005 5.829884 -3.843846 0.0093059295 TNMD <NA>
## 2 ENSG00000001036 552.508147 1.230798 0.0177720796 FUCA2 <NA>
## 3 ENSG00000001460 174.721205 2.333851 0.0002877880 STPG1 <NA>
## 4 ENSG00000003756 1203.189778 -1.228833 0.0065340245 RBM5 hsa-miR-30a-5p
## 5 ENSG00000004487 1630.043481 1.632978 0.0288345519 KDM1A hsa-miR-708-5p
## 6 ENSG00000004776 1117.731720 -4.241982 0.0004821517 HSPB6 <NA>
## mir_mean mir_FC mir_padj mir_r dnam_condensed_ranges dnam_FC
## 1 NA NA NA NA <NA> NA
## 2 NA NA NA NA chr6:143832390-143832943 -0.2322518
## 3 NA NA NA NA chr1:24742672-24743039 -0.9747990
## 4 422.5344 -1.456531 0.044155321 0.19043996 chr3:50126253-50127204 -0.3585306
## 5 173.5683 1.913127 0.005732359 0.05489606 chr1:23345815-23346677 -0.1537775
## 6 NA NA NA NA chr19:36248980-36249307 -0.1495335
## dnam_num_sites dnam_padj
## 1 NA NA
## 2 5 0.13394874
## 3 2 0.03485222
## 4 8 0.11008094
## 5 7 0.25297703
## 6 4 0.12834187
# Summary statistics
summary(integ_df)
## gene gene_mean gene_FC gene_padj
## Length:4888 Min. : 2.14 Min. :-6.8618 Min. :0.000000
## Class :character 1st Qu.: 111.26 1st Qu.:-2.2847 1st Qu.:0.002563
## Mode :character Median : 393.01 Median :-1.0554 Median :0.011532
## Mean : 1473.79 Mean :-0.2176 Mean :0.016077
## 3rd Qu.: 1093.59 3rd Qu.: 1.8701 3rd Qu.:0.027061
## Max. :181417.54 Max. : 7.1293 Max. :0.049870
##
## symbol mir mir_mean mir_FC
## Length:4888 Length:4888 Min. : 2.77 Min. :-4.2468
## Class :character Class :character 1st Qu.: 101.52 1st Qu.:-1.3912
## Mode :character Mode :character Median : 362.61 Median : 1.4374
## Mean : 2648.69 Mean : 0.7136
## 3rd Qu.: 1240.17 3rd Qu.: 2.1971
## Max. :75885.30 Max. : 5.3210
## NA's :1284 NA's :1284
## mir_padj mir_r dnam_condensed_ranges dnam_FC
## Min. :0.0000 Min. :-0.8683 Length:4888 Min. :-2.6695
## 1st Qu.:0.0025 1st Qu.:-0.4130 Class :character 1st Qu.:-0.3908
## Median :0.0092 Median :-0.1380 Mode :character Median :-0.1573
## Mean :0.0148 Mean :-0.0140 Mean :-0.0613
## 3rd Qu.:0.0265 3rd Qu.: 0.4069 3rd Qu.: 0.0726
## Max. :0.0492 Max. : 0.9997 Max. : 3.9967
## NA's :1284 NA's :1284 NA's :1653
## dnam_num_sites dnam_padj
## Min. : 1.000 Min. :0.0000
## 1st Qu.: 5.000 1st Qu.:0.0333
## Median : 8.000 Median :0.0693
## Mean : 8.718 Mean :0.0950
## 3rd Qu.:11.000 3rd Qu.:0.1187
## Max. :60.000 Max. :0.9492
## NA's :1653 NA's :1653
Here we want to make a table that has only one row per mRNA so that we can explore the influence of DE mir and dnam that are putative regulators of said mRNA. For multiple mir regulators we choose the best correlating one. For multiple dm cgi’s involved in regulating an mRNA we choose the one with the “best” (larger absolute value) fold change.
reg_df <-
integ_df %>%
# Group by gene to compare their regulators
group_by(gene) %>%
# Storing the best correlating mir into best_cor column
# NA is for those that are not under mir regulation
mutate(best_cor=ifelse(!is.na(mir_r), mir[which.max(abs(mir_r))], NA)) %>%
# filtering out mirs that are not the best putative regulators
dplyr::filter(is.na(mir) | mir == best_cor) %>%
# Storing the cgi with max abs value in best_cgi
mutate(best_cgi= ifelse(!is.na(dnam_FC),
dnam_condensed_ranges[which.max(abs(dnam_FC))],
NA)) %>%
# Filtering out cgi that are not the best
dplyr::filter(is.na(dnam_condensed_ranges) |
dnam_condensed_ranges == best_cgi) %>%
# Dropping uneeded columns. This info is already in mir and dnam ranges
dplyr::select(-best_cor, -best_cgi) %>%
ungroup()
reg_df
## # A tibble: 2,529 × 14
## gene gene_mean gene_FC gene_padj symbol mir mir_mean mir_FC mir_padj
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG000… 5.83 -3.84 0.00931 TNMD <NA> NA NA NA
## 2 ENSG000… 553. 1.23 0.0178 FUCA2 <NA> NA NA NA
## 3 ENSG000… 175. 2.33 0.000288 STPG1 <NA> NA NA NA
## 4 ENSG000… 1203. -1.23 0.00653 RBM5 hsa-mi… 423. -1.46 0.0442
## 5 ENSG000… 1630. 1.63 0.0288 KDM1A hsa-mi… 174. 1.91 0.00573
## 6 ENSG000… 1118. -4.24 0.000482 HSPB6 <NA> NA NA NA
## 7 ENSG000… 637. -3.38 0.00103 PDK4 hsa-mi… 1611. -0.920 0.00363
## 8 ENSG000… 44.5 -1.28 0.0229 ZMYND10 <NA> NA NA NA
## 9 ENSG000… 8.89 -4.52 0.000735 ARX <NA> NA NA NA
## 10 ENSG000… 7.28 3.43 0.0220 HOXA11 hsa-mi… 232. 1.01 0.0184
## # … with 2,519 more rows, and 5 more variables: mir_r <dbl>,
## # dnam_condensed_ranges <chr>, dnam_FC <dbl>, dnam_num_sites <dbl>,
## # dnam_padj <dbl>
Using two unsupervised clustering methods SOM and kmeans to explore clusters of genes under different combination of regulations. Features are fold changes of mRNA and mir data and quotient of beta intensity value for dna methylation. We hypothesized that groups with mRNA FC values will be opposite of mir and dnam regulators which would suggest that those groups are under regulation of mir and/or dnam.
K means clusters mRNA genes into k number of clusters in which each mRNA belongs to the cluster with the nearest mean (or cluster centroid). For more details look at the lecture materials from pmf machine learning.
# Preparing data for kmeans
reg_df
## # A tibble: 2,529 × 14
## gene gene_mean gene_FC gene_padj symbol mir mir_mean mir_FC mir_padj
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG000… 5.83 -3.84 0.00931 TNMD <NA> NA NA NA
## 2 ENSG000… 553. 1.23 0.0178 FUCA2 <NA> NA NA NA
## 3 ENSG000… 175. 2.33 0.000288 STPG1 <NA> NA NA NA
## 4 ENSG000… 1203. -1.23 0.00653 RBM5 hsa-mi… 423. -1.46 0.0442
## 5 ENSG000… 1630. 1.63 0.0288 KDM1A hsa-mi… 174. 1.91 0.00573
## 6 ENSG000… 1118. -4.24 0.000482 HSPB6 <NA> NA NA NA
## 7 ENSG000… 637. -3.38 0.00103 PDK4 hsa-mi… 1611. -0.920 0.00363
## 8 ENSG000… 44.5 -1.28 0.0229 ZMYND10 <NA> NA NA NA
## 9 ENSG000… 8.89 -4.52 0.000735 ARX <NA> NA NA NA
## 10 ENSG000… 7.28 3.43 0.0220 HOXA11 hsa-mi… 232. 1.01 0.0184
## # … with 2,519 more rows, and 5 more variables: mir_r <dbl>,
## # dnam_condensed_ranges <chr>, dnam_FC <dbl>, dnam_num_sites <dbl>,
## # dnam_padj <dbl>
k_tb <- data.frame(row.names = reg_df$gene,
dplyr::select(reg_df,
gene_FC,
mir_FC,
dnam_FC))
k_tb$dnam_FC <- k_tb$dnam_FC*2
k_tb[is.na(k_tb)] <- 0
annotation_tb <-
data.frame(row.names = reg_df$gene,
dnam_num_sites = log2(reg_df$dnam_num_sites),
gene_mean = log2(reg_df$gene_mean))
kmeansHeatmap(tb=k_tb,
ann_tb = annotation_tb,
clusters = 10:12,
nstart = 55,
GSEA = TRUE)
## Processing kmeans with 10 clusters.
## Ploting GO enrichment for: 10 clusters
## Processing kmeans with 11 clusters.
## Ploting GO enrichment for: 11 clusters
## Processing kmeans with 12 clusters.
## Ploting GO enrichment for: 12 clusters
SOM - Self Organizing maps. Purpose is to map data to lower dimensional space while conserving the distance between the samples as much as possible. SOM tool is used when the visualization is important. The size of the grid is predefined and it is chosen so that there are not many empty nodes. Check which nodes are activated by position and value
# Preparing data frame for SOM
som_tb <- data.frame(row.names = reg_df$gene,
dplyr::select(reg_df,
gene_FC,
mir_FC,
dnam_FC))
som_tb[is.na(som_tb)] <- 0
# Training SOM model
## Firstly define the SOM grid
set.seed(100)
som_grid <- somgrid(xdim=7, ydim=7, topo="hexagonal")
## SOM model
som_model <- som(as.matrix(som_tb),
som_grid,
rlen=500,
radius=2.5,
keep.data=TRUE,
dist.fcts="euclidean")
Visualizations:
plot(som_model, type = "codes", main = "Codes Plot", palette.name = topo.colors)
plot(som_model, type = "mapping", pchs=19, shape ="round")
plot(som_model, type = "counts", palette.name= rainbow)
Saving som visualizations
timestamp <- Sys.time()
timestamp <- format(timestamp, "%Y%m%d_%H%M")
filename <- "outputs/som_clustering/"
png(file = paste0(filename, timestamp, "_som_codes.png" ))
plot(som_model, type = "codes", main = "Codes Plot", palette.name = topo.colors)
dev.off()
## png
## 2
png(file = paste0(filename, timestamp, "_som_mapping.png" ))
plot(som_model, type = "mapping", pchs=19, shape ="round")
dev.off()
## png
## 2
png(file = paste0(filename, timestamp, "_som_counts.png" ))
plot(som_model, type = "counts", palette.name= rainbow)
dev.off()
## png
## 2
GSEA
Gene set enrichment on nodes of the grid.
In the next chunk we perform GSEA to explore pathways enriched by certain nodes. Nodes that enrich similar pathways are close together. Closer nodes have similar distribution of values across data sets in example they have similar pattern of expression and their mir and dnam regulation effects are similar.
query <- data.frame(row.names = rownames(som_model$data[[1]]), class = som_model$unit.classif)
query <- tibble::rownames_to_column(query, var = "genes")
query <- with(query, split(genes,class))
g2 <-
gost(query = query,
organism = "hsapiens",
exclude_iea = TRUE,
domain_scope = "annotated")
# Number of genes in each node
table(som_model$unit.classif)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20 21
## 25 39 113 38 20 17 27 11 29 72 54 67 36 39 6 16 158 89 36 40
## 22 23 24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 40 41 42
## 38 105 18 139 130 27 212 26 44 48 64 176 120 17 11 51 72 51 35 57
## 43 44 46 47 48 49
## 44 17 22 26 13 34
# Number of terms for each node arranged by frequency
dplyr::arrange(as.data.frame(table(g2$result$query)), desc(Freq))
## Var1 Freq
## 1 3 249
## 2 30 80
## 3 42 79
## 4 11 50
## 5 35 45
## 6 25 44
## 7 49 43
## 8 22 42
## 9 47 34
## 10 29 27
## 11 34 25
## 12 33 23
## 13 23 22
## 14 40 22
## 15 2 20
## 16 28 17
## 17 14 15
## 18 39 15
## 19 20 14
## 20 36 14
## 21 6 14
## 22 1 13
## 23 17 12
## 24 38 12
## 25 15 11
## 26 10 10
## 27 48 10
## 28 12 9
## 29 31 8
## 30 13 7
## 31 21 7
## 32 8 7
## 33 9 7
## 34 19 6
## 35 37 6
## 36 5 6
## 37 41 5
## 38 46 5
## 39 7 5
## 40 4 4
## 41 16 2
## 42 43 2
## 43 44 2
## 44 24 1
## 45 26 1
writexl::write_xlsx(g2$result, path = paste0(filename,timestamp,"_GSEA.xlsx"))
Analyzing the results of SOM clustering an GSEA enrichment. Cluster 3 has the most terms, highlighted terms are involved in cell cycle regulation.
head(dplyr::filter(dplyr::select(as.data.frame(g2$result),
query, source, term_name, p_value, recall),
query == 3 ))
## query source term_name
## 1 3 GO:BP carbohydrate derivative biosynthetic process
## 2 3 GO:BP anaphase-promoting complex-dependent catabolic process
## 3 3 GO:BP response to abiotic stimulus
## 4 3 GO:BP carbohydrate derivative metabolic process
## 5 3 GO:BP pre-replicative complex assembly
## 6 3 GO:BP cell cycle process
## p_value recall
## 1 0.00004811922 0.03000000
## 2 0.00008878841 0.09411765
## 3 0.00024574416 0.02427184
## 4 0.00026561854 0.02213280
## 5 0.00027004480 0.10606061
## 6 0.00033637506 0.01959248
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=hr_HR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=hr_HR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=hr_HR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=hr_HR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] plotly_4.10.0 ggplot2_3.3.5
## [3] htmlwidgets_1.5.4 pheatmap_1.0.12
## [5] RColorBrewer_1.1-2 biomaRt_2.48.3
## [7] kohonen_3.0.10 tibble_3.1.5
## [9] DESeq2_1.32.0 SummarizedExperiment_1.22.0
## [11] Biobase_2.52.0 MatrixGenerics_1.4.3
## [13] matrixStats_0.61.0 gprofiler2_0.2.1
## [15] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
## [17] IRanges_2.26.0 S4Vectors_0.30.2
## [19] BiocGenerics_0.38.0 dplyr_1.0.7
## [21] venn_1.10
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2
## [4] progress_1.2.2 httr_1.4.2 tools_4.1.2
## [7] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
## [10] DBI_1.1.1 lazyeval_0.2.2 colorspace_2.0-2
## [13] withr_2.4.2 tidyselect_1.1.1 prettyunits_1.1.1
## [16] curl_4.3.2 bit_4.0.4 compiler_4.1.2
## [19] cli_3.1.0 xml2_1.3.2 DelayedArray_0.18.0
## [22] labeling_0.4.2 sass_0.4.0 scales_1.1.1
## [25] genefilter_1.74.1 rappdirs_0.3.3 stringr_1.4.0
## [28] digest_0.6.28 rmarkdown_2.11 XVector_0.32.0
## [31] pkgconfig_2.0.3 htmltools_0.5.2 highr_0.9
## [34] dbplyr_2.1.1 fastmap_1.1.0 rlang_0.4.12
## [37] rstudioapi_0.13 RSQLite_2.2.8 farver_2.1.0
## [40] jquerylib_0.1.4 generics_0.1.1 jsonlite_1.7.2
## [43] crosstalk_1.1.1 BiocParallel_1.26.2 RCurl_1.98-1.5
## [46] magrittr_2.0.1 GenomeInfoDbData_1.2.6 Matrix_1.3-4
## [49] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
## [52] lifecycle_1.0.1 stringi_1.7.5 yaml_2.2.1
## [55] zlibbioc_1.38.0 BiocFileCache_2.0.0 grid_4.1.2
## [58] blob_1.2.2 crayon_1.4.1 lattice_0.20-45
## [61] Biostrings_2.60.2 splines_4.1.2 annotate_1.70.0
## [64] hms_1.1.1 KEGGREST_1.32.0 locfit_1.5-9.4
## [67] knitr_1.36 pillar_1.6.4 geneplotter_1.70.0
## [70] admisc_0.19 XML_3.99-0.8 glue_1.4.2
## [73] evaluate_0.14 data.table_1.14.2 vctrs_0.3.8
## [76] png_0.1-7 gtable_0.3.0 purrr_0.3.4
## [79] tidyr_1.1.4 assertthat_0.2.1 cachem_1.0.6
## [82] xfun_0.27 xtable_1.8-4 survival_3.2-13
## [85] viridisLite_0.4.0 AnnotationDbi_1.54.1 memoise_2.0.0
## [88] writexl_1.4.0 ellipsis_0.3.2